Statistical modelling of MT output corpora for Information Extraction
نویسندگان
چکیده
The output of state-of-the-art machine translation (MT) systems could be useful for certain NLP tasks, such as Information Extraction (IE). However, some unresolved problems in MT technology could seriously limit the usability of such systems. For example robust and accurate word sense disambiguation, which is essential for the performance of IE systems, is not yet achieved by commercial MT applications. In this paper we try to develop an evaluation measure for MT systems that could predict their possible usability for some IE tasks, such as scenario template filling, or automatic acquisition of templates from texts. We focus on statistically significant words for a text in a corpus, which are used now for some IE tasks such as automatic template creation (Collier, 1998). Their general importance for IE was also substantiated by our material, where they often include name entities and other important candidates for filling IE templates. We suggest MT evaluation metrics which are based on comparing the distribution of statistically significant words in corpora of MT output and in human reference translation corpora. We show that there are substantial differences in such distributions between human translations and MT output, which could seriously distort IE performance. We compare different MT systems with respect to the proposed evaluation measures and look into their relation to other MT evaluation metrics. We also show that the statistical model suggested could highlight specific problems in MT output that are related to conveying factual information. Dealing with such problems systematically could considerably improve the performance of MT systems and their usability for IE tasks.
منابع مشابه
Calculating inter-sectoral carbon flows of a mining sector via hypothetical extraction method
Mining is among the oldest industries. It is the primary source of raw materials for most of the sectors. Little is known about the complex inter-sectoral carbon linkages of the mining industry. In this work, we estimate the inter- and intra-sectoral carbon linkage impacts of the mining sector across ten major economies by applying an input-output model, and the hypothetical extraction method a...
متن کاملArabOnto: experimenting a new distributional approach for building Arabic ontological resources
Ontologies are useful for modelling and retrieving knowledge in complex information systems. Ontology construction environments use statistical and linguistic information to extract knowledge from corpora. Within the great improvement in this field, there is a need to introduce the Arabic language in these environments. We present the ArabOnto architecture modelling the process of Arabic ontolo...
متن کاملParallel Sentence Extraction from Comparable Corpora with Neural Network Features
Parallel corpora are crucial for machine translation (MT), however they are quite scarce for most language pairs and domains. As comparable corpora are far more available, many studies have been conducted to extract parallel sentences from them for MT. In this paper, we exploit the neural network features acquired from neural MT for parallel sentence extraction. We observe significant improveme...
متن کاملDevelopment and Application of a Cross-language Document Comparability Metric
In this paper we present a metric that measures comparability of documents across different languages. The metric is developed within the FP7 ICT ACCURAT project, as a tool for aligning comparable corpora on the document level; further these aligned comparable documents are used for phrase alignment and extraction of translation equivalents, with the aim to extend phrase tables of statistical M...
متن کامل: from Corpus Compilation to Bilingual Terminologies for MT and CAT Tools
This paper describes the TTC Web platform, an online demonstrator to show the whole pipeline to compile bilingual terminologies out of comparable corpora gathered from the web using the tools developed in the TTC project Terminology Extraction, Translation Tools and Comparable Corpora. We present the whole chain which has been integrated into the platform, as well as their main components: a fo...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003